Data science, intelligence and future analysis
Mozhdeh Salari; Reza Radfar; Mahdi Faghihi
Abstract
AbstractThe purpose of this research is to investigate the effective factors in predicting the academic performance of undergraduate students in the classification of four classes. To achieve this goal, the study follows the CRISP data mining method. The data set was extracted from the NAD educational ...
Read More
AbstractThe purpose of this research is to investigate the effective factors in predicting the academic performance of undergraduate students in the classification of four classes. To achieve this goal, the study follows the CRISP data mining method. The data set was extracted from the NAD educational system for the bachelor's degree in Shahed University for the entry of the years 2011 to 2021. 1468 records were used in data mining. First, the effective features on students' academic performance were extracted. Modeling was done using Rapidminer9.9 tool. To improve classification performance and satisfactory prediction accuracy, we use a combination of principal component analysis combined with machine learning algorithms and feature selection techniques and optimization algorithms. The performance of the prediction models is verified using 10-fold cross-validation. The results showed that the decision tree algorithm is the best algorithm in predicting students' performance with an accuracy of 84.71%. This algorithm correctly predicted the graduation of 77.88% of excellent students, 85.26% of good students, 84.69% of medium students, and 85.96% of weak students based on the final GPA. IntroductionThe main problem in this research is to identify the factors that are effective in predicting the academic performance of undergraduate students in Shahed University. Choosing the best machine learning algorithm in predicting academic performance among different modeling methods based on validation and evaluation of models is another issue in the present research. The purpose of this research is to investigate the effective factors in predicting the academic performance of undergraduate students in Shahed University using educational data mining based on classification models.Research questionsThe main question in this research is what factors affect the prediction of undergraduate students' performance and improving their performance?Sub questions1- Which modeling algorithms have better results in predicting student performance?2- What methods have been used to predict students' performance?3- What is the validity of the developed model for Shahed University students? 2- Research background1-2- Theoretical foundationsEducational data miningThe processing of educational data improves the prediction of student behavior and new approaches to educational policies (Capuano & Toti, 2019) (Viberg et al., 2018)Academic performanceAcademic performance of students means the extent to which they achieve educational goals (Banik & Kumar, 2019).2-2- review of past studiesThe highlighted cells in Table 1, based on past research, show the classification algorithms that have the most accuracy and effectiveness in predicting students' performance in the relevant research. The decision tree algorithm has been used the most in previous researches. The NB algorithm has been the most used in research after the decision tree. RF and ANN algorithms are next in use. After that, SVM and KNN algorithms have been used in researchTable 1. The results of research literature based on the use of classification algorithmsData mining algorithmDTRFNBKNNSVMANNLine RLLRAccuracy(Batool et al., 2023) * * (Marjan et al., 2023)****** (Abdelmagid & Qahmash, 2023) * ** * (Manoharan et al., 2023)** * * * (Alghamdi & Rahman, 2023)*** 99.34%(Alboaneen et al., 2022) * **** (Yağcı, 2022)* *** *70-75%(Dabhade et al., 2021)* * * 83.44%(Najafi & etal,2021)* 95%(Soltani & etal,2021)* ** (Cruz-Jesus et al., 2020) * ** *50-81%(Sokkhey & Okazaki, 2020)*** * (Rebai et al., 2020)** (Jayaprakash et al., 2020)*** (Zulfiker et al., 2020)** * (Musso et al., 2020) * (Waheed et al., 2020) * 85%(Salal & Abdullaev, 2019)* **** (Turabieh, 2019)* ** * (Xu et al., 2019)* ** (ghodoosi & etal,2019)* * (fadavi & etal,2019) * 95.84%(Ajibade et al., 2019)* *** 91.5%(Ahmad & Shahzadi, 2018) * 85%(Hasani & Bazrafshan, 2018)* * (Hussain et al., 2018)*** * (Umer et al., 2017)**** * (Khasanah, 2017)* * (Asif et al., 2017)* (Hoffait & Schyns, 2017) * * *92.34%(khosravi &etal,2017)* * (Mueen et al., 2016)* * * 86%(Amrieh et al., 2015)* ** (Yehuala, 2015)* * 92.34%(zahedi & etal,2015)* * * (Punlumjeak & Rachburee, 2015)* (Osmanbegović et al., 2014)** 71%(Shamloo & et al.,2014)* (Asadi & et al.,2013)* (Kabakchieva, 2013)* ** 60-75%(Oskouei & Askari, 2014)*** * 96%(Nghe et al., 2007)* * present research****** 94.17%3- MethodThis study follows the popular training data mining method CRISP. The data collection of Nad educational system for bachelor's degree in non-medical fields of Shahed University has been extracted from 2011 to 2021. We used the Label Encoder technique to encode the features. In this research, C4.5 and ID3 decision tree classification algorithms, random forest, Naïve Bayes, k-nearest neighbor and artificial neural network and gradient enhanced tree were used to analyze and classify students and predict the final GPA. Modeling was done using RapidMiner 9.9. To improve the classification performance and solve the misclassification problem, we use a combination of principal component analysis and feature selection techniques and optimization algorithms. In this research, prediction accuracy was evaluated using 10-fold cross-validation method for all algorithms. Also, different algorithms were compared using the analytical descriptive method and based on evaluation criteria, and the best prediction model was introduced in this research.4-Data analysis4-1 IntroductionThe best model is the model that has the best values for the selected performance measurement criteria(Lever et al., 2016). Figure 1 is a graph that compares the accuracy of the algorithms used in this research.Figure 1. Comparative chart of the accuracy of the algorithms According to Table 2, the DTC4.5 algorithm is able to predict the class of 1235 objects out of 1458, which gives it an accuracy value of 84.71%.Table 2. Confusion matrix of DT C4.5-GI&OSE research modelprecisionStudents with poor performanceStudents with average performanceStudents with good performanceStudents with excellent performance 78.64%002281Prediction 178.67%94929522Prediction 286.46%50498271Prediction 389.36%3614120Prediction 4 85.95%84.69%85.26%77.88%Recall4-2 important featuresThe prioritization of predictive variables based on their weight is as follows:Diploma GPA: 0.262Semester 1 GPA: 0.201Semester 2 GPA: 0.197Number of honors semesters: 0.122Conditional number: 0.114Year of entry: 0.1044-3 The results of the implementation of the student performance prediction modelThe results of the prediction model are shown in Table 3:Table 3. The results of the DT C4.5-GI&OSE model implementation 5- DiscussionIn the main method of research, namely DT C4.5-GI&OSE, in the classification mode of four classes, it is observed that the average of the diploma has the greatest effect on the process of predicting student performance. In response to the sub-question of a research, the best algorithm in the four-class mode is Decision Tree C4.5-GI&OSE with a prediction accuracy of 84.71. This model showed 84.17% accuracy, 83.42% sensitivity and 0.780 kappa. DT C4.5-GI&OSE technique correctly predicted the graduation of 77.88% of excellent students, 85.26% of good students, 84.69% of average students, and 85.96% of poor students.6-ConclusionThe obtained results show that there is a relationship between students' social and academic characteristics and their academic performance. DT C4.5-GI&OSE algorithm was the best algorithm for predicting the final GPA scores of students at the end of studies with a prediction accuracy of 84.71%. In this model, the average grade point average of the diploma has the greatest effect on the prediction process. Using machine learning models as a decision support tool improves the academic level of students and reduces the number of potential unsuccessful and dropout students. This study was carried out at the undergraduate level, which can be used in future research for the master's and doctoral level.Keywords: student performance prediction, data mining, machine learning, modeling, improving the quality of education